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Abstract 

We describe a new algorithm called Frequent Directions for deterministic matrix sketching in the 
row-updates model. The algorithm is presented an arbitrary input matrix A £ R nxd one row at a time. 
It performed 0(d£) operations per row and maintains a sketch matrix B £ Br X(i such that for any k < £ 

\\A T A - B T B\\ 2 < \\A - A k \\ * 2 F /(£ - k) and ||A — Tr Bk {A)\\' 2 F < (l + —)||A — A k \\ 2 F . 

Here, A *. stands for the minimizer of j| A — A^Wp over all rank k matrices (similarly B *.) and 7tg fe (A) is 
the rank k matrix resulting from projecting A on the row span of Bi ^\ 

We show both of these bounds are the best possible for the space allowed. The summary is mergeable, 
and hence trivially parallelizable. Moreover, Frequent Directions outperforms exemplar implementations 
of existing streaming algorithms in the space-error tradeoff]^ 


1 Introduction 


The data streaming paradigm [28,45] considers computation on a large data set A where data items arrive 


in arbitrary order, are processed, and then never seen again. It also enforces that only a small amount 
of memory is available at any given time. This small space constraint is critical when the full data set 
cannot fit in memory or disk. Typically, the amount of space required is traded off with the accuracy of 
the computation on A. Usually the computation results in some summary S( A) of A, and this trade-off 
determines how accurate one can be with the available space resources. 

Modern large data sets are often viewed as large matrices. For example, textual data in the bag-of-words 
model is represented by a matrix whose rows correspond to documents. In large scale image analysis, each 
row in the matrix corresponds to one image and contains either pixel values or other derived feature values. 
Other large scale machine learning systems generate such matrices by converting each example into a list of 
numeric features. Low rank approximations for such matrices are used in common data mining tasks such as 
Principal Component Analysis (PCA), Latent Semantic Indexing (LSI), and k-means clustering. Regardless 
of the data source, the optimal low rank approximation for any matrix is obtained by its truncated Singular 
Value Decompositions (SVD). 


* Thanks to support by NSF CCF-1350888, IIS-1251019, and ACI-1443046. 

^Supported by the XDATA program of DARPA administered through Air Force Research Laboratory contract FA8750-12- 
C0323. 

’The matrix Ao is defined to be an all zeros matrix of the appropriate dimensions. 

2 This paper combines, extends and simplifies the results in [39] |]27 | and |56|. 
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In large matrices as above, one processor (and memory) is often incapable of handling all of the dataset 
A in a feasible amount of time. Even reading a terabyte of data on a single processor can take many hours. 
Thus this computation is often spread among some set of machines. This renders standard SVD algorithms 
infeasible. Given a very large matrix A, a common approach is to compute in the streaming paradigm a 
sketch matrix B that is significantly smaller than the original. A good sketch matrix B is such that Am B 
or A T A « B T B and so computations can be performed on B rather than on A without much loss in 
precision. 


Prior to this work, there are three main matrix sketching approaches, presented here in an arbitrary or¬ 
der. The first generates a sparser version of the matrix. Sparser matrices are stored more efficiently and 
can be multiplied faster by other matrices ||4][7]|23j. The second approach is to randomly combine matrix 
rows [40[47]5T]52j. The proofs for these rely on subspace embedding techniques and strong concentration 
of measure phenomena. The above methods will be collectively referred to as random-projection in the 
experimental section. A recent result along these lines 1141, gives simple and efficient subspace embeddings 
that can be applied in time 0{nnz{A)) for any matrix A. We will refer to this result as hashing in the exper¬ 
imental section. While our algorithm requires more computation than hashing, it will produce more accurate 
estimates given a fixed sketch size. The third sketching approach is to find a small subset of matrix rows 
(or columns) that approximate the entire matrix. This problem is known as the ‘Column Subset Selection 
Problem’ and has been thoroughly investigated Ql0jT8j[l9j22j25j. Recent results offer algorithms with 
almost matching lower bounds @|T3}[h8j. A simple streaming solution to the ‘Column Subset Selection 
Problem’ is obtained by sampling rows from the input matrix with probability proportional to their squared 
£2 norm. Despite this algorithm’s apparent simplicity, providing tight bounds for its performance required 
over a decade of research [6] [19} [22 25 46j49j[53| . We will refer to this algorithm as sampling. Algorithms 
such as CUR utilize the leverage scores of the rows [21 ] and not their squared 1 2 norms. The discussion on 
matrix leverage scores goes beyond the scope of this paper, see [1221 for more information and references. 


In this paper, we propose a fourth approach: frequent directions. It is deterministic and draws on the 
similarity between the matrix sketching problem and the item frequency estimation problem. We provide 
additive and relative error bounds for it; we show how to merge summaries computed in parallel on disjoint 
subsets of data; we show it achieves the optimal tradeoff between space and accuracy, up to constant factors, 
for any row-update based summary; and we empirically demonstrate that it outperforms exemplars from all 
of the above described approaches. 


1.1 Notations and preliminaries 

Throughout this manuscript, for vectors, || • || will denote the Euclidian norm of a vectors | x | = y^A xf. 
For matrices || ■ || will denote the operator (or spectral) norm |.41 = supM a .ifa 1 | Ax\\. Unless otherwise 
stated, vectors are assumed to be column vectors. The notation a* denotes the ith row of the matrix A. The 
Frobenius norm of a matrix A is defined as || A\\p = y/^A =1 l! a *ll 2 - It refers to the l x £ identity matrix. 

The singular value decomposition of matrix B e M^ xd is denoted by [U, E, V] = svd(i?). If £ < d 
it guarantees that B = UEU T , U T U = h, V T V = I d , U <E R <x/ , V € M dx£ , and E 6 R txt is a 
non-negative diagonal matrix such that Eqi > £ 2,2 > ... > E^ > 0. It is convenient to denote by 
Uk, and 14 the matrices containing the first k columns of U and V and E/. £ M. kxk the top left k x k 
block of E. The matrix .4/,. = f/^E^Vj 7 is the best rank k approximation of A in the sense that Aj, = 
arg min C:rank ( C ) <fe ||A — C\\ 2 ,f- Finally we denote by ttb(A) the projection of the rows A on the span of the 
rows of B. In other words, ttb(A) = /I B' B where (-p indicates taking the Moore-Penrose psuedoinverse. 
Alternatively, setting [U, E, V] = svd (B), we also have 7 tb(A) = AV r V. Finally, we denote tt^(A) = 
A l ' /; r V}. , the right projection of A on the top k right singular vectors of B. 
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1.2 Item Frequency Approximation 


Our algorithm for sketching matrices is an extension of a well known algorithm for approximating item 
frequencies in streams. The following section shortly overviews the frequency approximation problem. 
Here, a stream A = { a i, • • • , a n } has n elements where each a, 6 [<Z]. Let fj = |{a, E A \ a t = j}\ be 
the frequency of item j and stands for number of times item j appeal's in the stream. It is trivial to produce 
all item frequencies using 0(d) space simply by keeping a counter for each item. Although this method 
computes exact frequencies, it uses space linear to the size of domain which might be huge. Therefore, we 
are interested in using less space and producing approximate frequencies fj. 

This problem received an incredibly simple and elegant solution by Misra and Gries 1441. Their algo¬ 
rithm [44|] employes a map of i < d items to i counters. It maintains the invariant that at least one of the 
items is mapped to counter of value zero. The algorithm counts items in the trivial way. If it encounters an 
item for which is has a counter, that counter is increased. Else, it replaces one of the items mapping to zero 
value with the new item (setting the counter to one). This is continued until the invariant is violated, that is, 
£ items map to counters of value at least 1. At this point, all counts are decreased by the same amount until 
at least one item maps to a zero value. The final values in the map give approximate frequencies fj such that 
0 < fj — fj < n/£ for all j E [d}\ unmapped j imply fj = 0 and provides the same bounds. The reason 
for this is simple, since we decrease £ counters simultaneously, we cannot do this more that n/i times. And 
since we decrement different counters, each item’s counter is decremented at most n/i times. Variants of 
this very simple (and very clever) algorithm were independently discovered several times [17 29 35 431 
From this point on, we refer to these collectively as FrequentItems. 

Eater, Berinde et al. [8j proved a tighter bound for FrequentItems. Consider summing up the errors by 
Rk = Y!i=k+1 | fj — fj\ and assume without loss of generality that fj > fj + \ for all j. Then, it is obvious 
that counting only the top k items exactly is the best possible strategy if only k counters are allowed. That 
is, the optimal solution has a cost of Rk = Yli=k +1 fj- Berinde et al. [8] showed that if FrequentItems 
uses £ > k counters then | fj — fj\ < Rk/(£ — k). By summing the error over the top k items it is easy to 
obtain that Rk < j^Rk- Setting!? = \k + k/e\ yields the convenient form of Ilk < (1 +e)/(/,.. The authors 
also show that to get this kind of guarantee in the streaming setting ()(/;:/e) bits are indeed necessary. This 
make FrequentItems optimal up to a log factor in that regard. 


1.3 Connection to Matrix Sketching 

There is a tight connection between the matrix sketching problem and the frequent items problem. Let A be 
a matrix that is given to the algorithm as a stream of its rows. For now, let us constrain the rows of A to be 
indicator vectors. In other words, we have a* E {ei,..., (Q}, where e 3 is the jth standard basis vector. Note 
that such a matrix can encode a stream of items (as above). If the ith element in the stream is j, then the ith 
row of the matrix is set to «, = e 3 . The frequency fj can be expressed as fj = | Ae 3 11 2 . Assume that we 
construct a matrix B E W xd as follows. First, we run FrequentItems on the input. Then, for every item 
j for which fj > 0 we generate one row in B equal to f ■ 7 ■ e 3 . The result is a low rank approximation of A. 
Note that rank(H) = £ and that 1173ej 11 2 = fj. Notice also that ||A||p = n and that A T A = diag(/i,..., fd ) 
and that B T B = diag(/i,..., fd). Porting the results we obtained from FrequentItems we get that 
||A t A — B 1 B ||2 = maxj \ fj — fj\ < \\A\\ 2 F /(£ — k). Moreover, since the rows of A (corresponding 
to different counters) are orthogonal, the best rank k approximation of A would capture exactly the most 
frequent items. Therefore, \\A — Ak\\ 2 F = Rk = Yli=k +1 fj- ^ we follow the step above we can also reach 
the conclusion that ||A — Tr^fA)|| ^ < f 1 ; . 11,4 — ,4/,|| p. We observe that, for the case of matrices whose 
rows are basis vectors, FrequentItems actually provides a very efficient low rank approximation result. 

3 The reader is referred to 1351 for an efficient streaming implementation. 
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In this paper, we argue that an algorithm in the same spirit as FrequentItems obtains the same bounds 
for general matrices and general test vectors. 


1.4 Main results 

We describe the FrequentDirections algorithm, an extension of FrequentItems to general matrices. 
The intuition behind FrequentDirections is surprisingly similar to that of FrequentItems: In the 
same way that FrequentItems periodically deletes £ different elements, FrequentDirections peri¬ 
odically ‘shrinks’ £ orthogonal vectors by roughly the same amount. This means that during shrinking steps, 
the squared Frobenius norm of the sketch reduces t times faster than its squared projection on any single 
direction. Since the Frobenius norm of the final sketch is non negative, we are guaranteed that no direction 
in space is reduced by “too much”. This intuition is made exact below. As a remark, when presented with 
an item indicator matrix, FrequentDirections exactly mimics a variant of FrequentItems. 

Theorem 1.1. Given any matrix A E M" x d , FREQUENTDIRECTIONS processes the rows of A one by one 
and produces a sketch matrix B E M^ x , such that for any unit vector x E 

0 < ||Ax|| 2 - \\Bxf < \\A - A k |||7(£ - k). 


This holds also for all k < £ including k = 0 where we define Aq as the n x d all zeros matrix. 

Other convenient formulations of the above are A T A A B T B A A r A — Id ■ \\A — A k \\ 2 F /{£ — k) or 
A T A — B t B A 0 and || A T A — B T B || 2 < || A — A k \\ 2 F /{£ — k). Note that setting l = \k + l/e\ yields error 
of e|| A — ,4/.|||, using Oidk + d/e) space. This gives an additive approximation result which extends and 
refines that of Liberty |39|. We also provide a multiplicative error bound from Ghashami and Phillips [27]. 


Theorem 1.2. Let B E be the sketch produced by FrequentDirections. For any k < £ it holds 
that 

||A — ir F (A)\\f < (1 + j-)m — A k \\ f- 

Note that by setting l = \k + k/e\ one gets the standard form \\A — tt f (A) |||, < (1 + e)|| A — A k \\ 2 F . Fre¬ 
quentDirections achieves this bound while using less space than any other known algorithm, namely 
0(£d) = Ofkd/e) floating point numbers, and is deterministic. Also note that the 7 t f operator projects onto 
a rank k subspace £>/., where as other similar bounds [9irT3l|24]l42l|5T| project onto a higher rank subspace 


B, and then considers the best rank k approximation of tt f (A). Our lower bound in Theorem 1.4 from 
Woodruff (561, shows our approach is also tight even for this weaker error bound. 


Theorem 1.3. Assuming a constant word size, any matrix approximation algorithm, which guarantees 
||A T A — B T B ||2 < || A — A k \\ 2 F /{£ — k) must use Ll(d£) space; or in more standard terms, guarantees of 
11 AAA — B t B \\2 < e\\A — Afc||p must use Ll(dk + d/e) space. 


Theorem 1.4. Assuming a constant word size, any randomized matrix approximation streaming algorithm 
in the row-wise-updates model, which guarantees ||A — 7r^(A)||^ < {1 -\- e)\\A — Ak\\ 2 F and that succeeds 
with probability at least 2/3, must use Ll(kd/e) space. 


Theorem [L4] claims that FrequentDirections is optimal with respect to the tradeoff between sketch 
size and resulting accuracy. On the other hand, in terms of running time, FrequentDirections is not 
known to be optimal. 


Theorem 1.5. The running time of FREQUENTDIRECTIONS on input A E W nxd and parameter £ is 
0(nd£). FREQUENTDIRECTIONS is also “embarrassingly parallel” because its resulting sketch matrices 
constitute mergeable summaries [j5J. 
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1.5 Practical Implications 

Before we describe the algorithm itself, we point out how it can be used in practice. As its name suggests, 
the FrequentItems algorithm is often used to uncover frequent items in an item stream. Namely, if one 
sets £ > 1/e, then any item that appears more than en times in the stream must appear in the final sketch. 
Similarly, FrequentDirections can be used to discover the space of unit vectors (directions) in space x 
for which \\Ax\\ 2 > e\\A\\ 2 F , for example. This property makes FrequentDirections extremely useful 
in practice. In data mining, it is common to represent data matrices A by their low rank approximations Ay.. 
But, choosing the right k for which this representation is useful is not straight forward. If the chosen value 
of k is too small the representation might be poor. If k is too large, precious space and computation cycles 
are squandered. The goal is therefore to pick the minimal k which provides an acceptable approximation. To 
do this, as practitioners, we typically compute the top k' 3> k singular vectors and values of A (computing 
svd (A) partially). We then keep only the k singular vectors whose corresponding singular values arc larger 
than some threshold value t. In other words, we only “care about” unit vectors x such that L4.x| > t. 
Using FrequentDirections we can invert this process. We can prescribe in advance the acceptable 
approximation t and directly find a space containing those vectors x for which | Ax | > t. Thereby, not only 
enabling a one-pass or streaming application, but also circumventing the svd computation altogether. 


2 Frequent Directions 

The algorithm keeps an l x d sketch matrix B that is updated every time a new row from the input matrix A 
is added. The algorithm maintains the invariant that the last row of the sketch B is always all-zero valued. 
During the execution of the algorithm, rows from A simply replace the all-zero valued row in B. Then, the 
last row is nullified by a two-stage process. First, the sketch is rotated (from the left) using its SVD such 
that its rows are orthogonal and in descending magnitude order. Then, the sketch rows norms are “shrunk” 
so that at least one of them is set to zero. 


Algorithm 2.1 FrequentDirections 

Input: £,Ae R nxd 


B <- 0 £xd 


for i e 1, ..., n do 


Be <— a t 

# ith row of A replaced (all-zeros) (th row of B 

[U,E,V] <- svd (B) 


c ^ SU T 

# Not computed, only needed for proof notation 

6 ^ a j 


B <- VS 2 - bh • V T 

# The last row of B is again zero 

return B 



2.1 Error Bound 

This section proves our main results for Algorithm |2. 1 1 which is our simplest and most space efficient algo¬ 
rithm. The reader will notice that we occasionally use inequalities instead of equalities at different parts of 
the proof to obtain three Properties. This is not unintentional. The reason is that we want the same exact 
proofs to hold also for Algorithm |3.1| which we describe in Section[3] Algorithm |3.1| is conceptually identical 
to Algorithm |2.1[ it requires twice as much space but is far more efficient. Moreover, any algorithm which 
produces an approximate matrix B which satisfies the following facts (for any choice of A) will achieve the 
eiTor bounds stated in Lemma lrTl and Lemma [L2l 
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In what follows, we denote by 8i, B^, Cy the values of <5, B and C respectively after the /th row of A 
was processed. Let A = Ya =i be the total mass we subtract from the stream during the algorithm. To 
prove our result we first prove three auxiliary properties. 

Property 1. For any vector x we have ||Ac|| 2 — ||-Bx|[ 2 > 0. 

Proof. Use the observations that (ai,x ) 2 + ||Sr i _ 1 ia;|| 2 = ||C'[j]x|| 2 . 

n n 

||Ar|| 2 - ||5x|| 2 = ^[(a*,x) 2 + ||5[i_ip|| 2 - H^pf] > ^[HC^pf - ||S w x|| 2 ] >0. if 

i=l i =1 

Property 2. For any unit vector x G we have ||Ac|| 2 — ||i?x|| 2 < A. 

Proof To see this, first note that ||C[jp|[ 2 — ||i?[jp|| 2 < HC^Cp — < 5i. Now, consider the fact 

that ||C[jp|| 2 = ||5[j_ip|| 2 + ||ajX’|| 2 . Substituting for ||C[jp|| 2 above and taking the sum yields 

ll^pf - \\B[{\x \\ 2 = ^(WB^xf + ||aix|| 2 ) - WB^xf 
i i 

= ||Ac|| 2 + ||i?[o]x|| 2 — ||5[ n ]x|| 2 = ||^4x|| 2 — ||.Bx|| 2 . 

Combining this with ||C[j]x|| 2 — \\B[ i] x \\ 2 < Si = A yields that || Ac|| 2 — ||i?x|| 2 < A. □ 

Property 3. A£ < \\A\\ 2 F - \\B\\ 2 F . 

Proof In the ith round of the algorithm |C [ 4 ]\\' F > \\B[i]\\ 2 F + £Si and ||q i] |||, = ||%_i] ||| + |h|| 2 . By 
solving for ||aj|| 2 and summing over i we get 

n n 

Mil! = ^2 H a *ll 2 — ^2 M[i] III ~ Il-B[i-1] III + = ||-B||| + □ 

i= 1 i =1 


Equipped with t he a bove observations, and no additional requirements about the construction of B, we 
can prove Lemma E3 Namely, that for any k < l, ||2l r yl — B T B ||2 < ||A — A k \\ F /(£ — k ). We use 
Property [2] verbatim and bootstrap Property[3]to prove a tighter bound on A. In the following, y, correspond 
to the singular vectors of A ordered with respect to a decreasing corresponding singular values. 


Ai? < mui' — ii-Biii 

= '^2\\Ayi\f + ^2 l|Ayi|| 2 - ||-B||| 

i=1 i=k-\-1 

k 

= "22 ll^y*ll“ + M — -Afclll — ll-B||| 

i=1 

k 

< ||^4 — A k Hf’ + ^2 i\\ A Vi\\ 2 ~ ll-^J/ilh) 
2=1 

S ||A — + kA. 


via Property [3] 

= ^2\\ A vi\\ 2 

2=1 


k 

x;ii 5 W ii 2< iis 

i=l 


2 

F 


via Property [2] 


Solving A£ < ||j4 — A k \\^ + kA for A to obtain A < \\A — A k \\ 2 F /{£ — fc), which combined with Property [I] 
and Property |2]proves Theorem |1.1[ for any unit vector x we have 


0 < Px|| 2 - \\Bx\\ 2 < A < \\A - A k \\ 2 f /(£ - k). 
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Now we can show that projecting A onto B/, provides a relative error approximation. Here, y, correspond 
to the singular vectors of A as above and v t to the singular vectors of B in a similar fashion. 

k 

m — ir%{A)\\p = ||-A|||i — ||7r|j(^4)||^ = ||A||J. - ^ 11^*11' 

2=1 
k 

2—1 
k 

< PIIf - E lisrf 

2=1 
k 

< M|& - EtlPrf - A ) 

2=1 

= \\Af F - \\A k f F + kA 

< m+ -——ha _ ^Hf 

= j^i w A ~ Ak ^- 

This concludes the proof of Theorem |1.2[ It is convenient to set £ = \k + k/e\ which results in the standard 
bound form ||A - ir^(A)\\ 2 F < (1 + e)||A - A k \\ 2 F . 

3 Running Time Analysis 

Each iteration of Algorithm |2. llis dominated by the computation of the svd(/i). The standard running time 
of this operation is 0(d£ 2 ) [30]. Since this loop is executed once per row in A the total running time would 
naively be ()(ndt: 2 ). However, note that the sketch matrix B actually has a very special form. The first /: — 1 
rows of B are always orthogonal to one another. This is a result of the sketch having been computed by 
an svd in the previous iteration. Computing the svd of this matrix is possible in 0(d£) time using the Gu- 
Eisenstat procedure pT| . This requires using the Fast Multiple Method (FMM) and efficient multiplication 
of Cauchy matrices by vectors which is, unfortunately, far from being straight forward. It would have been 
convenient to use a standard svd implementation and still avoid the quadratic term in i in the running time. 
We show below that this is indeed possible at the expense of doubling the space used by the algorithm. 
Algorithm |3. 1 1 gives the details. 

Algorithm 3.1 Fast-FrequentDirections 
I nput: f,4e R nxd 
B all zeros matrix E R 2ixd 
for i E 1,..., n do 

Insert a* into a zero valued row of B 
if B has no zero valued rows then 
[U, E, V] <- svd(H) 

C = EC T 
6 E- erf 

B «— y/ max(E 2 — 1^5, 0) • V T 

return B 


# Only needed for proof notation 
# The last £ + 1 rows of B are zero valued. 


Pythagorean theorem 


via Property [T] 

j j 

since \\Bvi ii 2 > E ii^ii 2 
2=1 

via Property [2] 


2=1 


by A < \\A- A k \\ F /(£- k) 
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Note that in Algorithm |3.1| the svd of B is computed only nj (f+1) times because the “if’ statement is only 
triggered once every l + 1 iterations. Thereby exhibiting a total running time of (){{n/t)d(i 2 ) = 0{nd£). 
The reader should revisit the proofs in Section |2.1| and observe that they still hold. Consider the values 
of i for which the “if” statement is triggered. It still holds that o a C'gC'jj] - A 5I d and that 

||C[j] Hi’ — |£>[,;] ||| > 16. For the other values of i. the sketch simply aggregates the input rows and there is 
clearly no incurred error in doing that. This is sufficient for the same analysis to go through and complete 
our discussion on the correctness of Algorithm [3TT] 


3.1 Parallelization and Merging Sketches 

In ex tic me ly large datasets, the processing is often distributed among several machines. Each machine 
receives a disjoint input of raw data and is tasked with creating a small space summary. Then to get a global 
summary of the entire data, these summaries need to be combined. The core problem is illustrated in the case 
of just two machines, each process a data set A\ and .4 2 , where A = [A\: Ao], and create two summaries B\ 
and Bo, respectively. Then the goal is to create a single summary B which approximates A using only B\ 
and B' 2 . If B can achieve the same formal space/error tradeoff as each B\ to A\ in a streaming algorithm, 
then the summary is called a mergeable summary (5 j. 

Here we show that the FrequentDirections sketch is indeed mergeable under the following proce¬ 
dure. Consider B' = [Bi] B 2 ] which has 21 rows; then run FrequentDirections (in particular Al¬ 
gorithm |3.l| ) on B' to create sketch B with £ rows. Given that B\ and B 2 satisfy Facts 00 and [ 3 ] with 
parameters Ai and A 2 , respectively, we will show that B satisfies the same facts with A = A 1 + A 2 + 4, 
where 6 is taken from the single shrink operation used in Algorithm 3.1 This implies B automatically 
inherits the bounds in Theorem 1 1.1 1 and Theorem 1 1.21 as well. 

First note that B' satisfies all facts with A' = A 1 + A 9 , by additivity of squared spectral norm along any 
direction x (e.g. ||f?ix|| 2 + ||f? 2 x|| 2 = 1111 2 ) and squared Frobenious norms (e.g. ||f?i||| + ||f? 2 ||| = 
|If?'|||), but has space twice as large as desired. Property |T|is straight forward for B since B only shrinks in 
all directions in relation to B'. For Property [2]follows by considering any unit vector x and expanding 

||f?x|| 2 > ||H'x|| 2 — 6 > ||Ax|| 2 — (Ai + A 2 ) — 6 = ||Ax|| 2 — A. 

Similarly, Property [3] can be seen as 


5||| < Ill'll! - 6£ < ||A||| - (Ai + A 2 )£ -51 = 


I - A L 


This property trivially generalizes to any number of partitions of A. It is especially useful when the matrix 
(or data) is distributed across many machines. In this setting, each machine can independently compute a 
local sketch. These sketches can then be combined in an arbitrary order using FrequentDirections. 


3.2 Worst case update time 


The total running time of the Algorithm 3.1 is 0(nd £) and the amortized running time per row update 
is 0(d£). However, the worst case update time is still Q(d/ 2 ) in those cases where the svd is computed. 
Using the fact that FrequentDirections sketches are mergeable, we can actually use a simple trick to 
guarantee a worst case 0(d£) update time. The idea is to double the space usage (once again) and hold two 
sketches, one in ‘active’ mode and one in svd ‘maintenance’ mode. For any row in the input, we first add 
it to the active sketch and then spend 0(d£) floating point operations in completing the svd of the sketch 
in maintenance mode. After l updates, the active sketch runs out of space and must go into maintenance 
mode. But, in the same time, a total of 0(d£ 2 ) floating point operations were invested in the inactive sketch 
which completed its svd computation. At this point, we switch the sketch roles and continue. Once the 
entire matrix was processed, we combine the two sketches using their mergeable property. 




4 Space Lower Bounds 


In this section we show that FrequentDirections is space optimal with respect to the guaranteed ac¬ 
curacy. We present nearly-matching lower bounds for each case. We show the number of bits needed is 


equivalent to d times the number of rows FrequentDirections requires. We first prove Theorem 1.3 
showing the covariance error bound in Theorem|l.l|is nearly tight, regardless of streaming issues. 


Theorem 4.1. Let B be a £ X d matrix approximating a n x d matrix A such that |.4 7 ,4 — B T B 11 2 < 

11 A — A k | [ 2 f /(£ — k). For any algorithm with input as annxd matrix A, the space complexity of representing 
B is t l (d£) bits of space. 


Proof For intuition, consider the set of matrices A such that for all A G A we have A T A is an F/4 
dimensional projection matrix. For such matrices ||A r y4|| = 1 and ||A — A^\\ 2 F /{£ — k) = 1/4. The 
condition ||A T A — B T B H 2 <1/4 means that B T B gets “close to” A T A which should intuitively require 
roughly Q(d£) bits (which are also sufficient to represent A T A). 

To make this argument hold mathematically, consider a set of matrices A such that \\AjA t — A 1 - Af \ > 
1/2 for all Ai,Aj G A. Consider also that the sketching algorithm computes B which corresponds to 
A, G A. Since ||A T Gl — B r B H 2 <1/4 and || Aj Ai — AjAjj| >1/2 there could be only one index such 


that \\A[ Aj_ — B t B\\ < 1/4 by the triangle inequality. Therefore, since B indexes uniquely into A, it must 
be that encoding B requires at least log(|>l|) bits. To complete the proof we point out that there exists a set 
Afor which \\Aj Ai — Aj Aj\\ > 1 /2 for all A,, Aj G A and log(|.A|) = Q(d£). This is proven by Kapralov 
and Talwar [34] using a result of Absil, Edelman, and Koev [2]. In [34] Corollary 5.1, setting 5 = 1/2 and 
k = 1 /4 (assuming d > £) yields that |A| = 2and completes the proof. □ 


The consequence of Theorem |4.1| is that the space complexity of FrequentDirections is optimal 
regardless of streaming issues. In other words, any algorithm satisfying \\A T A— B T B \\-2 < \\A—A k \\ 2 F / {£— 
k ) must use space Ll(£d) since this is the information lower bound for representing B: or any algorithm 
satisfying \\A 7 A — B T B \\2 < e|| A — A k \\ 2 F must have £ = Fl(k + l/e) and hence use Q(d£) = Fl(dk + d/e) 
space. 

Next we will prove Theorem 


1.4 


showing the subspace bound \\A — tt^(A)\\ 2 f < (1 + e)\\A — A^\ 


2 
F 

preserved by Theorem 1.2 with £ = k + k/e rows, and hence 0(kd/e ) space, is also nearly tight in the 
row-update streaming setting. In fact it shows that finding B such that || A — /I IFB\\ p = || A — ttb(A)\\f < 
(l+e)|| A—Ak || f requires OJjlk/e) space in a row-update streaming setting. This requires careful invocation 
of more machinery from communication complexity common in streaming lower bounds, and hence we start 
with a simple and weaker lower bound, and then some intuition before providing the full construction. 


4.1 A Simple Lower Bound 

We start with a simple intuitive lemma showing an kl(kd) lower bound, which we will refer to. We then 
prove our main Fl(kd/e) lower bound. 

Lemma 4.1. 4/i y streaming algorithm which, for every input A, with constant probability (over its internal 
randomness) succeeds in outputting a matrix Rfor which \\A — AB)R\\p < (1 + e)|| A — must use 

kl(kd) bits of space. 

Proof. Let S be the set of /c-dimcnsional subspaces over the vector space GF{2) d , where GF(2) denotes 
the finite field of 2 elements with the usual modulo 2 arithmetic. The cardinality of S is known [411 to be 

(2 d — l)(2 d — !)••• (2 d ~ k+l — 1 ) dk/2 _ k * dk/e 

(2 k — 1 ) ( 2 fc — 1 — 1 ) • • • 1 - - ’ 

where the inequalities assume that k < d/3. 
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Now let A 1 and A 2 be two k x d matrices with entries in {0,1} whose rows span two different k- 
dimensional subspaces of GF(2) d . We first claim that the rows also span two different /c-dinicnsional 
subspaces of M rf . Indeed, consider a vector v £ GF(2) d which is in the span of the rows of A 1 but not in the 
span of the rows of A 2 . If A 1 and A 2 had the same row span over M rf , then v = Yli WiA 2 , where the w t € M 
and A 2 denotes the i-th row of A 2 . Since v has integer coordinates and the A 2 have integer coordinates, 
we can assume the Wi are rational, since the irrational parts must cancel. By scaling by the least common 
multiple of the denominators of the Wi, we obtain that 

a ' v = ^2,Pi A h (!) 

i 

where a, /3\,..., /3k are integers. We can assume that the greatest common divisor (gcd) of a, p],.... (A 
is 1, otherwise the same conclusion holds after we divide a, /3i,..., /3k by the gcd. Note that (jTJ) implies 
that av = Yli PiA 2 mod 2, i.e., when we take each of the coordinates modulo 2. Since the /3j cannot 
all be divisible by 2 (since a would then be odd and so by the gcd condition the left hand side would 
contain a vector with at least one odd coordinate, contradicting that the right hand side is a vector with even 
coordinates), and the rows of A 2 form a basis over GF(2 d ), the right hand side must be non-zero, which 
implies that a = 1 mod 2. This implies that v is in the span of the rows of A 2 over GF(2 d ), a contradiction. 

It follows that there are at least 2 dl '' ,i(> distinct dimensional subspaces of W 1 spanned by the rows of the 
set of binary k x d matrices A. For each such A, ||A — Ak\\p = 0 and so the row span of R must agree with 
the row span of A if the streaming algorithm succeeds. It follows that the output of the streaming algorithm 
can be used to encode log 2 2' lf ' :/( ‘ = kl(dk) bits of information. Indeed, if A is chosen at random from this 
set of at least binary matrices, and Z is a bit indicating if the streaming algorithm succeeds, then 


\R\ > H{R) > I(R;A\Z) > (2/3)I(R-,A\ Z = 1) > (2/3)(dfc/6) = f 1(dk), 


where |i?| denotes the expected length of the encoding of R, H is the entropy function, and I is the mutual 
information. For background on information theory, see 151. This completes the proof. □ 


4.2 Intuition for Main Lower Bound 


The only other lower bounds for streaming algorithms for low rank approximation that we know of are due 


to Clarkson and Woodruff [ 13 ]. As in their work, we use the Index problem in communication complexity to 
establish our bounds, which is a communication game between two players Alice and Bob, holding a string 
x 6 {0, l} r and an index i e [r] =: {1, 2,..., r}, respectively. In this game Alice sends a single message to 


Bob who should output x,- with constant probability. It is known (see, e.g., [361) that this problem requires 


Alice’s message to be £l(r) bits long. If Alg is a streaming algorithm for low rank approximation, and Alice 
can create a matrix A x while Bob can create a matrix Bi (depending on their respective inputs x and i), then 
if from the output of Alg on the concatenated matrix [A x \ Bi] Bob can output x, with constant probability, 
then the memory required of Alg is Q(r) bits, since Alice’s message is the state of Alg after running it on 

A x - 

The main technical challenges are thus in showing how to choose A x and B ,, as well as showing how the 
output of Alg on [A x ; Bi] can be used to solve Index. This is where our work departs significantly from that 


of Clarkson and Woodruff [13). Indeed, a major challenge is that in Theorem 1 1 ,4[ we only require the output 
to be the matrix R, whereas in Clarkson and Woodruff’s work from the output one can reconstruct AR)R. 
This causes technical complications, since there is much less information in the output of the algorithm to 
use to solve the communication game. 

is that given a 2 x d matrix A = [1, x\ 1, 0 d ], where x is a 


The intuition behind the proof of Theorem 


1.4 


random unit vector, then if P = R)R is a sufficiently good projection matrix for the low rank approximation 
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problem on A, then the second row of AP actually reveals a lot of information about x. This may be 
counterintuitive at first, since one may think that [1, O' 1 : 1, ()' / is a perfectly good low rank approximation. 
However, it turns out that [l,x/2;l,x/2]isa much better low rank approximation in Frobenius norm, and 
even this is not optimal. Therefore, Bob, who has [1, O' 1 ] together with the output P, can compute the second 
row of AP, which necessarily reveals a lot of information about x (e.g., if AP ss [1, x/2; 1, x/2], its second 
row would reveal a lot of information about x), and therefore one could hope to embed an instance of the 
Index problem into x. Most of the technical work is about reducing the general problem to this 2 x d 
primitive problem. 


4.3 Proof of Main Lower Bound for Preserving Subspaces 

Now let c > 0 be a small constant to be determined. We consider the following two player problem 
between Alice and Bob: Alice has a ck/e x d matrix A which can be written as a block matrix [I,R], 
where I is the ck/e x ck/e identity matrix, and R is a ck/e x (d — ck/e) matrix in which the entries 
are in {—1 /{d — ck/e) 1 / 2 , +l/(d — ck/e) 1 / 2 }. Here [I, A] means we append the columns of I to the 
left of the columns of A; see Figure [T| Bob is given a set of k standard unit vectors e tl ,, Cj k , for 
distinct i\,... E [ck/e] = {1,2,..., ck/e}. Here we need c/e > 1, but we can assume e is less than 
a sufficiently small constant, as otherwise we would just need to prove an Q(kd) lower bound, which is 
established by Lemma [4~Tj 

e ik . The goal 


Let B be the matrix [A; e 




obtained by stacking A on top of the vectors e, 


U 9 ••• 1 


is for Bob to output a rank-A; projection matrix P E M. dxd for which \\B — BP \\ F <(l+e)\\B-B k \\ F . 

Denote this problem by /. We will show the randomized 1-way communication complexity of this prob¬ 
lem R 1 1 J™ ay (f), in which Alice sends a single message to Bob and Bob fails with probability at most 1/4, 
is Fl(kd/e) bits. More precisely, let // be the following product distribution on Alice and Bob’s inputs: the 
entries of R are chosen independently and uniformly at random in {—1 /{d — ck/e) 1 / 2 , +1 /(d — ck/e) 1 / 2 }, 
while {fi,..., if .} is a uniformly random set among all sets of k distinct indices in [ck/e]. We will show 
that D^~/J^ y (f) = il(kd/e), where /V /J/ 11 )/) denotes the minimum communication cost over all deter¬ 
ministic 1-way (from Alice to Bob) protocols which fail with probability at most 1/4 when the inputs are 
distributed according to p. By Yao’s minimax principle (see, e.g., [371), R 1 J]/ iay (f) > D l ^/J/ y (f). 

We use the following two-player problem Index in order to lower bound L/ 1 t -7(7). In this problem 
Alice is given a string x E {0, l} r , while Bob is given an index i E [r]. Alice sends a single message to 
Bob, who needs to output Xi with probability at least 2/3. Again by Yao’s minimax principle, we have that 


R 


1 —way / 


l—way 


1/3 ( lndeX ) > D u,l /3 


(Index), where v is the distribution for which x and i are chosen independently 


and uniformly at random from their respective domains. The following is well-known. 


Fact 4.1. [36] D 1 770 ndex ) = ^(0- 
fheorem 4.2. For c a small enough pos\ 


Proof. We will reduce from the Index problem with r = ( ck/e)(d — ck/e). Alice, given her string x to 
Index, creates the ck/e X d matrix A = [I, R] as follows. The matrix I is the ck/e x ck/e identity matrix, 
while the matrix A is a ck/e x (d — ck/e) matrix with entries in {— l/(d — ck/e) 1 / 2 , +l/(d — ck/e) 1 / 2 }. 
For an arbitrary bijection between the coordinates of x and the entries of A, Alice sets a given entry in A 
to — l/(d — ck/e) 1 / 2 if the corresponding coordinate of x is 0, otherwise Alice sets the given entry in A 
to +1 /{d — ck/e) 1 / 2 . In the Index problem, Bob is given an index, which under the bijection between 
coordinates of x and entries of A, corresponds to being given a row index i and an entry j in the z-th row 
of A that he needs to recover. He sets ii = i for a random t E [k], and chooses k — I distinct and random 
indices ij E [ck/e] \ {ip}, for j E [k] \ {£}. Observe that if (x, i) ~ v, then (A, ii,...,if.) ~ p. Suppose 
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d — ck/e 


ck/e 


HF 



Rs 


Alice 


Rr 


T 

Bob 

1 


Figure 1: Illustration of matrix B used for lower bound construction. 


there is a protocol in which Alice sends a single message to Bob who solves / with probability at least 3/4 
under p. We show that this can be used to solve Index with probability at least 2/3 under v. The theorem 
will follow by Fact |4.1| Consider the matrix B which is the matrix A stacked on top of the rows ey,..... e ifc , 
in that order, so that B has ck/e + k rows. 

We proceed to lower bound \\B — BP\\' 2 F in a certain way, which will allow our reduction to Index to be 
earned out. We need the following fact: 

Fact 4.2. ((2.4) of ^5 Q]j ) Let A be an m x n matrix with i.i.d. entries which are each +1 / y/n with probability 
1/2 and — 1/y/n with probability 1/2, and suppose mjn < 1. Then for all t > 0, 


Pr[\\A\\ 2 > 1 +t + yf///n\ < a e -“' nf3/2 . 

where a, a 7 > 0 are absolute constants. Here || A\/ is the operator norm sup r of A. 


We apply Fact 4.2 to the matrix R, which implies, 

Pr[||i*|| 2 > 1 + V~c+s/{ck/e)/{d-{ck/e))} < ae -^(d-{ck/e))^^ 


and using that d> k/e and c > 0 is a sufficiently small constant, this implies 

Pr[||i?|| 2 > 1 + 3Vc] < e~P d , (2) 

where (3 > 0 is an absolute constant (depending on c). Note that for c > 0 sufficiently small, (1 + 3 y/c) 2 < 
1 + 7 yfc. Let £ be the event that ||7?||| < 1 + 7 y/c, which we condition on. 

We partition the rows of B into If and B 2 , where If contains those rows whose projection onto the first 
ck/e coordinates equals a for some i ^ {ii,..., ik}. Note that B\ is ( ck/e — k) x d and B 2 is 2k x d. 
Here, B 2 is 2 k x d since it includes the rows in A indexed by ii, ..., 'f, together with the rows ,..., e\ k . 
Let us also partition the rows of R into Ilr and Rg, so that the union of the rows in Ilr and in Rg is equal 
to R, where the rows of Hr are the rows of II in If. and the rows of Rg are the non-zero rows of R in B 2 
(note that k of the rows are non-zero and k are zero in B 2 restricted to the columns in R). 
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Lemma 4.2. For any unit vector u, write u = ur + us + ut, where S = {+,..., T = [ck/e\ \ S, 
and R = [d] \ \ck/e], and where UAfor a set A is 0 on indices j ^ A. Then, conditioned on £ occurring, 
\\Bu\\ 2 < (1 + 7-y/c)(2 - ||ttr|| 2 - \\ur\\ 2 + 2|| u s + wt||||wr||). 

Proof. Let C be the matrix consisting of the top ck/e raws of B, so that C has the form [I, R], where I is a 
ck/exck/e identity matrix. By construction of B, ||i?rt|| 2 = ||+s|| 2 + ||CTt|| 2 - Now, Cu = us + ut + Rur, 
and so 


|| C7?z|| 2 = \\us + ut\V + ||7?u/j||" + 2{u s + ut) T Rur 

< 11 rtg + ut\\ 2 + (1 + 7v / c)||«r?J| 2 + 2||us + ut\\\\Rur}\ 

< (1 + 7v / c)(||ns|| 2 + UnTlI 2 + |Kts|| 2 ) + (1 + 3\/c)2|| us + ut\\\\ur 

< (1 + 7v / c)(l + 2||ns + ut\\ H^rII), 

and so 

||-Bn|| 2 < (1 + 7 v / c)(l + ||ns|| 2 + 2\\us + wt||||«h||) 

= (1+ 7^c)(2 — \\ur\\ 2 — \\ut\\ 2 + 2\\us+ Ut\\\\ur\\). □ 


We will also make use of the following simple but tedious fact. 

Fact 4.3. For x £ [0,1], the function f(x) = 2xy/l — x 2 — x 2 is maximized when x = yJ\/2— \/5/10. 
We define C, to be the value of f(x) at its maximum, where (, = 2/sfb + \/5/10 — 1/2 ~ .618. 


Proof. Setting y 2 = 1 — x 2 , we can equivalently maximize f(y ') = — 1 + 2 yy/l — y 2 + y 2 , or equivalently 
g(y) = 2 yyj 1 — y 2 + y 1 . Differentiating this expression and equating to 0, we have 

_ 2y 2 

2^1 -y 2 ~ + 2r/ = 0. 

V 1 -v 


Multiplying both sides by y/l — y 2 one obtains the equation 4;// 2 — 2 = 2yy/l — y 2 , and squaring both 
sides, after some algebra one obtains 5y 4 — 5 y 2 + 1 = 0. Using the quadratic formula, we get that the 
maximizer satisfies y 2 = 1/2 + y/5/10, or x 2 = 1/2 — \/5/10. □ 

Corollary 4.1. Conditioned on £ occurring, \\B\\ 2 < (1 + 7y / c)(2 + Q. 


Proof. By Lemma 4.2 for any unit vector u, 


||-Bu|| 2 < (1 + 7\/c)(2 - HutII 2 — ||«r|| 2 + 21| us + ut||||+r||)- 


Suppose we replace the vector us + ut with an arbitrary vector supported on coordinates in S with the same 
norm as us + ut- Then the right hand side of this expression cannot increase, which means it is maximized 
when ||ut|| = 0, for which it equals (1 + 7y/c){2 — H'UrII 2 + 2^/1 — ||it_R|| 2 ||'UR||), and setting ||wr| 


equal the x in Fact 4.3 we see that this expression is at most (1 + 7 t/c)(2 + (). 


to 

□ 


Write the projection matrix P output by the streaming algo rithm as UU 1 , w here U is d x k with or¬ 
thonormal columns u l (note that R)R = P). Applying Lemma 4.2 and Fact 4.3 to each of the columns u l , 
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we have: 


BPf F 


< 


< 


k 

\\BU\\ 2 F = J2\\Bu i f 
2 = 1 
k 

(1 + 7y/c) ^(2 - II^tII 2 “ \\ u r\\ 2 + 2IIMs + U l T \\ ||Ufl||) 

2—1 

k 

a+ i\fc){2k - x;(ii«5-ii 2 )+E< 2 ii4+“Tinian - ii«kii 2 )) 
2—1 2—1 

(1 + 7\/c)(2fc - ]T(K|| 2 ) + ^(2^1-KIPKII - ll<4f)) 
2—1 2—1 

k 

(1 + 7\/c)(2k - y;(KH 2 ) + k() 

2—1 

k 

(1 + 7\/c)((2 + C)k - ^ ll^rll 2 )- 
2—1 


(3) 


Using the matrix Pythagorean theorem, we thus have, 

\\b-bp\\ 2 f = \\b\\ 2 f - \\bp\\ 2 f 

k 

> 2 ck/e + k — (1 + 7a/c)((2 + ()k — ^ ||«tI| 2 ) using \\B\\ F = 2 ck/e + k 

2=1 

k 

> 2 ck/e + k — (1 + 7y / c)(2 + ()k + (1 + 7\/c) y] ||rtr|| 2 - (4) 

2=1 

We now argue that |_B — cannot be too large if Alice and Bob succeed in solving /. First, we need 

to upper bound \\B — Bk\\ 2 F . To do so, we create a matrix B^ of rank-/,: and bound ||B — Bk\\ 2 F - Matrix 
/)/,. will be 0 on the rows in B\. We can group the rows of Ik) into k pairs so that each pair has the form 
ei + v l ,ei, where i G [ck/e] and v l is a unit vector supported on [d] \ [ck/e]. We let Y, be the optimal (in 
Frobenius norm) rank-1 approximation to the matrix [e* + v l \ e,]. By direct computation [j] the maximum 
squared singular value of this matrix is 2 + £. Our matrix /!/,. then consists of a single Y, for each pair in B 2 . 
Observe that /)/, has rank at most k and 

11 B — <||B — Bk\\ F < 2ck/e k — (2 + £)/c, 


Therefore, if Bob succeeds in solving / on input B, then, 

\\B — BP\\ 2 f < (1 + e){2ck/e + k — (2 + £)k) < 2ck/e + k — (2 + Qk + 2ck. (5) 

Comparing (|4j) and (|5]), we arrive at, conditioned on £: 

k 

ll^rll 2 < i , 7 r ' + C)k + 2ck) < c\k , (6) 

U 1 + 7 ^ 

where c\ > 0 is a constant that can be made arbitrarily small by making c > 0 arbitrarily small. 

4 For an online SVD calculator, see http : //www. bluebit. gr/matrix-calculator/ 
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Since P is a projector, ||.BP||r = ||.B£/j|R. Write U = U + U, where the vectors in U are supported on 
T, and the vectors in U are supported on \d\ \ T. We have, 


\\BUf F < \\BWldk < (1 + 7y/c)(2 + Qak < c 2 k, 


where the first inequality uses ||5i7||i? < ||.B||2||?7 ||f and the second inequality uses that events occurs, 
and the third inequality holds for a constant C 2 > 0 that can be made arbitrarily small by making the constant 
c > 0 arbitrarily small. 

Combining with ([5]) and using the triangle inequality, 


> 

\\bp\\ f -\\bu\\ f 

using the triangle inequality 

> 

\\BP R — y/ c 2 k 

using our bound on \\BU ||r 

= 

yJ\\Bf F -\\B-BP\\ 2 F -yfek 

by the matrix Pythagorean theorem 

> 

y/{2 + ()k - 2 ck - y/ c 2 k 

by 0 

> 

y/ (2 + Qk — c%k, 



(V) 


where C 3 > 0 is a constant that can be made arbitrarily small for c > 0 an arbitrarily small constant (note that 
C 2 > 0 also becomes arbitrarily small as c > 0 becomes arbitrarily small). Hence, \\B[J\\‘ 2 F > (2+Qk— 03 k, 
and together with Corollary |4.l[ that implies \\U\\ 2 F > k — c±k for a constant C 4 that can be made arbitrarily 
small by making c > 0 arbitrarily small. 

Our next goal is to show that 11-S 2 ^ 11 ^ i s almost as large as ||f?[7|||,. Consider any column u of U, and 
write it as us + ur- Hence, 

\\Bu \\ 2 = \\R t u r \\ 2 + \\B 2 u\\ 2 using B\u = R t ur 

< 11 Rtur 11 2 + 11 iis + Rs^r 11 2 + 11 'MS 11 2 by definition of the components 

= \\Rur\\ 2 + 2\\us\\ 2 + ^iLgRsUR using the Pythagorean theorem 

< 1 + 7y/c+ ||%|| 2 + 2||?2s||||.Rs'u r || 

using ||Rur || 2 < (1 + 7yfc)\\u R \\ 2 and ||ur || 2 + ||n 5|| 2 < 1 
(also using Cauchy-Schwarz to bound the other term). 

Suppose H-Rs^rII = t||ur|| for a value 0 < r < 1 + 7 yfc. Then 

\\Bu \\ 2 < 1 + 7 y/c + H'usll 2 + 2r||ws||v / l - Ills'll 2 - 

We thus have, 

||Ru|| 2 < 1 + 7Vc+ (1 - r)||ti 5 || 2 + r(||n s || 2 + 2\\u s \\y/l - ||us|| 2 ) 

< 1 + 7y/c+ (1 - r) + r(l + C) by Fact 

< 2 + rC + 7 Vc, ( 8 ) 

and hence, letting T\,... .t^ denote the coiTesponding values of r for the k columns of U, we have 

k 

\\BU\\ 2 f < (2 + 7V~c)k + (J2 T i- (9) 

i= 1 

Comparing the square of ([V]) with (|9]), we have 

k 

y 'n > k-c 5 k, (10) 

i=l 


4.3 
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where C 5 > 0 is a constant that can be made arbitrarily small by making c > 0 an arbitrarily small constant. 
Now, 11O’|> k — c^k as shown above, while since \\R s ur\\ = Tj||rt^|| if vlr is the i-th column of U, by 
© we have 

\\RsUr\\f > (1 — c 6 )k (11) 

for a constant eg that can be made arbitrarily small by making c > 0 an arbitarily small constant. 

Now ||.ROr||!i < (1 + 7 y/c)k since event £ occurs, and H-RC/rHI* = \\RtUr\\ 2 f + \\RsUr\\ 2 f since the 
rows of R are the concatenation of rows of R$ and Rr, so combining with ( [IT] ), we arrive at 

\\RtUr\\ 2 f < C7 k, ( 12 ) 

for a constant C 7 > 0 that can be made arbitrarily small by making c > 0 arbitrarily small. 

Combining the square of 0 with ( |~i~2| ). we thus have 

\\B 2 U\\f = \\BU\\l-\\B 1 U\\ 2 F = \\BU\\ 2 F -\\R T U R \\ 2 F >(2 + C)k-c 3 k-c 7 k 

> (2 + ()k-c 8 k, (13) 

where the constant c 8 > 0 can be made arbitrarily small by making c > 0 arbitrarily small. 

By the triangle inequality, 

\\B 2 U\\ f > \\B 2 U\\ f - \\B 2 U\\ f > ((2 + C )k - c 8 kf / 2 - (c 2 fc) 1/2 - (14) 


Hence, 

\\B 2 - B 2 P\\ f = \J\\B 2 \\ 2 f - \\B 2 U\\ 2 f by matrix Pythagorean, \\B 2 U\\ F = \\B 2 P\\ f 

< \/\\B 2 \\ f - (\\B 2 U\\ F - \\B 2 U \\ f ) 2 by triangle inequality 

< \J%k- (((2 + Q)k - cgk ) 1 / 2 - {c-i.k ) 1 / 2 ) 2 using <[T4|), \\B 2 \\ F = 3fc,(15) 

or equivalently, || B 2 - B 2 P\\ 2 F <3 k - ((2 + ()k - c 8 k) - ( c 2 k ) + 2fe(((2 + () - c 8 )c 2 ) 1//2 < (1 — 
()k + c 8 k + 2fc(((2 + () — c 8 )c 2 ) 1,/2 < (1 — ()k + cgk for a constant eg > 0 that can be made arbitrarily 
small by making the constant c > 0 small enough. This intuitively says that P provides a good low rank 
approximation for the matrix B 2 . Notice that by ( [15] ), 

\\B 2 P\\ 2 f = \\B 2 \\ 2 F -\\B 2 -B 2 P\\ 2 F >3k-(l-C)k-cgk>(2 + C)k-c 9 k. (16) 

Now B 2 is a 2k x d matrix and we can partition its rows into k pairs of rows of the form Z( = (e lf + 
R , lf , ei t ), for £ = 1,..., k. Here we abuse notation and think of R lf as a d-dimensional vector, its first 
ck/e coordinates set to 0. Each such pair of rows is a rank-2 matrix, which we abuse notation and call Zj. 
By direct computation [^] Zj has squared maximum singular value 2 + (\ We would like to argue that the 
projection of P onto the row span of most Zf has length very close to 1. To this end, for each Zf consider the 
orthonormal basis Vf of right singular vectors for its row space (which is spanfe,;,,. R 1t )). We let vj x . vj 2 
be these two right singular vectors with corresponding singular values o\ and a 2 (which will be the same 
for all l, see below). We are interested in the quantity A = P / / / '-^1l r which intuitively measures how 

much of P gets projected onto the row spaces of the Zj. 

Lemma 4.3. Conditioned on event £, A £ [k — c \ 0 A:, k + c 10 k], where cio > 0 is a constant that can be 
made arbitrarily small by making c > 0 arbitrarily small. 

3 We again used the calculator at http: / /www . bluebit. gr/matrix-calculator / 
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Proof. For any unit vector u, consider Y^t=\ ||l/ T ri|| 2 . This is equal to ||« 5 || 2 + ||-R,s''Ur|1 2 - Conditioned on 


v i “I 12 

£, |ji? 5 U/jJ| 2 < (1 + 7 v / c)||ttH|| 2 . Hence, J2e=i l!^> T «|| 2 < 1 + 7 yfc, and consequently, A < k( 1 + 7-y / c)- 
On the other hand, ||_E? 2 .P||p = Y^e=\ II ZjP\\ 2 F . Since \\Zj ||| < 2 + </, it follows by (16) that A > 
k — (cg/(2 + ())k, as otherwise A would be too small in order for ( fl6] ) to hold. 

The lemma now follows since yfc. and C 9 can be made arbitrarily small by making the constant c > 0 
small enough. □ 

We have the following corollary. 

Corollary 4.2. Conditioned on event £, for a 1 — y/cg + 2c 10 fraction ofi E [k], \ \f ’ P11 j r < 1 + c\\, and 
for a 99/100 fraction of l E [A:], we have ||V^ T P||^ > 1 — cn, where c\\ > 0 is a constant that can be made 
arbitrarily small by making the constant c > 0 arbitrarily small. 

Proof. For the first part of the corollary, observe that 

PJP\\l = a?||^iP|| 2 + a 2 2 ||^ 2 P|| 2 , 

where vj 1 and vj 2 are right singular vectors of V/\ and o±, 02 are its singular values, with of = 2 + £ and 


o 2 = 1 — C- Since A < k + cioA by Lemma 4.3 we have 

k 


W V h P W 2 + \\ v t,2 P \\ 2 ^ k + C 10 fc - 


e=i 


If Ya =1 \\ v e, 2 P \\ 2 - ( c 9 + 2cto )k, then 

\\B 2 PWl < E H^ Tp l 


12 

If 


< (2 + ()(k + ci 0 A - 2ci 0 A - cgk) + (1 - ()( 2 ci 0 k + cgk) 

< (2 + Q(k - cgk) - (2 + C)ci 0 A: + (1 - C)(2ci 0 A + cgk ) 

< (2 + Qk — 2 cgk — (,cg k — 2 c\gk — (c\gk + 2cioA + cgk — 2 (cigk — (cgk 

< (2 + Qk - (1 + 2()c g k + -3Cciofc 

< (2 + ()k — cgk 

which is a contradiction to (16). Hence, Yle=i I^^T’ll 2 < (eg + 2cio)A. This means by a Markov bound 

that a 1 — y/cg + 2cio fraction of t satisfy ||n^ 2 P|[ 2 < y/cg + 2cio, which implies that for this fraction that 

\WfP\\ 2 p < 1 + y/cg + 2cio- 

For the second part of the corollary, suppose at most 99A;/100 different i satisfy | Vf P \ \ j r > 1 — 
200 y/eg + 2cio- By the previous part of the corollary, at most y/cg + 2 c\ok of these i can satisfy 11 VjP \ \ 

1 + y/cg + 2cio- Hence, since 


> 


WfPf F < 2, 


< 2yJ eg + 2cioA + (1 + y/ eg + 2cio)(99/100 — y/ eg + 2cio)A: + (1 — 200\/ eg + 2cio)A/100 

< 2y/cg~+2cfgk + 99A/100 + 99ky/cg + 2cio/100 — ky/cg + 2cio + A/100 — 2y/ Cg + 2cioA 

< k — y/ Cg + 2ciofc/100 

< k — \/2ciofc/100 

< k — Cigk, 


where the final inequality follows for cio > 0 a sufficiently small constant. This is a contradiction to 
Hence, at least 99 k /100 different i satisfy 


Lemma 


4.3 


VjP \\f > 1 — 200a/c 9 + 2cio- Letting cn = 


200 y/af+ 2 cio, we see that cn can be made an arbitrarily small constant by making the constant c > 0 
arbitrarily small. This completes the proof. □ 


17 

























Recall that Bob holds i = ig for a random i E [k]. It follows (conditioned on £) by a union bound that 
with probability at least 49/50, ||b£ T -P||p E [1 — cn, 1 + cn], which we call the event T and condition 
on. We also condition on event Q that \\Zj P\\ 2 F > (2 + (j — c 12 , for a constant C 12 > 0 that can be made 
arbitrarily small by making c > 0 an arbitrarily small constant. Combining the first part of Corollary |4.2| 
together with ( p~6] ), event Q holds with probability at least 99.5/100, provided c > 0 is a sufficiently small 
constant. By a union bound it follows that £, F, and Q occur simultaneously with probability at least 49/51. 

As \\ZJP\\ 2 f = af\\vf 1 P \\ 2 + a 2 \\vJ 2 P\\ 2 , with af = 2 + ( and cr 2 = 1 — £, events £,F, and Q imply 
that IlnJ/PH 2 > 1 — C 13 , where C 13 > 0 is a constant that can be made arbitrarily small by making the 
constant c > 0 arbitrarily small. Observe that \\vj x P\\ 2 = (fy.i, z) 2 , where z is a unit vector in the direction 
of the projection of vg \ onto P. 

By the Pythagorean theorem, \\v^\ — {v£i, z)z\\ 2 = 1 — (v£\, z ) 2 , and so 

IKi - (^,i,z)z|| 2 < c 14 , (17) 

for a constant C 14 > 0 that can be made arbitrarily small by making c > 0 arbitrarily small. 

We thus have ZjP = a\{v£ t \, z)u£^z T + 02(^21 w)u£^w T , where w is a unit vector in the direction of 
the projection of of V £ t 2 onto P, and U£ t \, 1^2 are the left singular vectors of Zj. Since T occurs, we have 
that | (v£ t 2 , w) | < cn, where cn > 0 is a constant that can be made arbitrarily small by making the constant 
c > 0 arbitrarily small. It follows now by ( fF7j ) that 

|| ZjP- (J\U£ ) iv\ 1 \\ 2 f < C15 , (18) 

where C 15 > 0 is a constant that can be made arbitrarily small by making the constant c > 0 arbitrarily 
small. 

By direct calculatiorj^J U£ t i = —.851 — .526 Ri e and v^i = —.851 — .526 Ri r It follows that 

|| ZjP — (2 + C)[-724 a t + A48Ri e ; .448 ei e + .277Ri e \\\ F < C 15 . Since e lf is the second row of Zj, it 
follows that Ije^P - (2 + C)(-448e^ + .277R it )\\ 2 < c 15 . 

Observe that Bob has e\ t and P, and can therefore compute e[ ( P. Moreover, as C 15 > 0 can be made 
arbitrarily small by making the constant c > 0 arbitrarily small, it follows that a 1 — ci 6 fraction of the signs 
of coordinates of e[ ( P, restricted to coordinates in [d] \ [ck/e\, must agree with those of (2 + Q).277R lf , 
which in turn agree with those of Ri v Here ci 6 > 0 is a constant that can be made arbitrarily small by 
making the constant c > 0 arbitrarily small. Hence, in particular, the sign of the j-th coordinate of R lt , 
which Bob needs to output, agrees with that of the j-th coordinate of ef e P with probability at least 1 — c\q. 
Call this event 7~i. 

By a union bound over the occurrence of events £. P, Q, and 7~L, and the streaming algorithm succeeding 
(which occurs with probability 3/4), it follows that Bob succeeds in solving Index with probability at least 
49/51 — 1/4 — ci 6 > 2/3, as required. This completes the proof. □ 


5 Related Work on Matrix Sketching and Streaming 

As mentioned in the introduction, there are a variety of techniques to sketch a matrix. There are also several 
ways to measure error and models to consider for streaming. 

In this paper we focus on the row-update streaming model where each stream elements appends a row to 
the input matrix A. A more general model the entry-update model fixes the size of A at n X d, but each 
element indicates a single matrix entry and adds (or subtracts) to its value. 

'’Using the online calculator in earlier footnotes. 
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Error measures. The accuracy of a sketch matrix B can be measured in several ways. Most commonly 
one considers an n x d, rank k matrix A that is derived from B (and sometimes A) and measures the 
projection error where proj-err = ||A — ,411 ^ / 11,4 — ,4/,. || 2 ,-. When A can be derived entirely from B, we 
call this a construction result, and clearly requires at least f l(n + d) space. When the space is required to 
be independent of to, then A either implicitly depends on A, or requires another pass of the data. It can 
then be defined one of two ways: A = 7r g(A) (as is considered in this paper) takes B^, the best rank-A; 
approximation to B, and then projects A onto /i/.; A = 11/.(44) projects A onto B, and then takes the 
best rank-A: approximation of the result. Note that tt%(A) is better than 11^(A), since it knows the rank-A; 
subspace to project onto without re-examining A. 

We also consider covariance error where covar-err = ||A T A — B T B\\ 2 /\\A\\ 2 F in this paper. One can 
also bound || — B T B\\ 2 /\\A — Afc|||., but this has an extra parameter k, and is less clean. This measure 

captures the norm of A along all directions (where as non-construction, projection error only indicates how 
accurate the choice of subspace is), but still does not require Q(n) space. 


Sketch paradigms Given these models, there are several types of matrix sketches. We describe them here 
with a bit more specificity than in the Introduction, with particular attention to those that can operate in the 
row-update model we focus on. Specific exemplars are described which are used in out empirical study to 
follow. The first approach is to sparsify the matrix ||4j[7][23|, by retaining a small number of non-zero. These 
algorithms typically assume to know the n X d dimensions of A, and are thus not directly applicable in out 
model. 

Random-projection: The second approach randomly combines rows of the matrix 140 47 5T]52|. For an 
efficient valiant |3 j we will consider where the sketch B is equivalent to RA where R is an £ x n matrix 
such that each element Rij e {—1 /\/T} uniformly. This is easily computed in a streaming fashion, 
while requiring at most 0(£d) space and (){ld) operation per row updated. Sparser constructions of random 
projection matrices are known to exist j fl6] , [33| . These, however, were not implemented since the running 
time of applying random projection matrices is not the focus of this experiment. 

Hashing: A variant of this approach [ 141 uses an extra sign-hash function to replicate the count-sketch 1121 
with matrix rows (analogously to how FrequentDirections does with the MG sketch). Specifically, 
the sketch B is initialized as the £ X d all zeros matrix, then each row a,; of A is added to row li(i) as 
B h ^ <— />/,(,;) + s(i)ai, where h : [n] — > [£] and s : [to] <— {—1,1} are perfect hash functions. There is no 
harm in assuming such functions exist since complete randomness is naively possible without dominating 
either space or running time. This method is often used in practice by the machine learning community and 
is referred to as “feature hashing” [55). Surprising new analysis of this method fT4| shows this approach is 
optimal for construction bounds and takes 0(nnz(A)) processing time for any matrix A. 

Sampling: The third sketching approach is to find a small subset of matrix rows (and/or columns) that 
approximate the entire matrix. This problem is known as the ‘Column Subset Selection Problem' and has 
been thoroughly investigated [9j 10} 18] 19] 22] 251. Recent results offer algorithms with almost matching 
lower bounds |9j[T3] 181. A simple streaming solution to the ‘Column Subset Selection Problem’ is obtained 
by sampling rows from the input matrix with probability proportional to their squared £2 norm. Specifically, 
each row Bj takes the value Ai/y/Ipi iid with probability pi = ||Aj|| 2 /||A|||,. The space it requires is 0(£d) 
in the worst case but it can be much lower if the chosen rows are sparse. Since the value of ||A||i? is not a 
priori known, the streaming algorithm is implemented by £ independent reservoir samplers, each sampling 
a single row according to the distribution. The update running time is therefore 0(d) per row in A. Despite 
this algorithm's apparent simplicity, providing tight bounds for its error performance required over a decade 
of research |6}[19[^|25]j46j49[ 53j. Advanced such algorithms utilize the leverage scores of the rows 1211 
and not their squared £2 norms. The discussion on matrix leverage scores goes beyond the scope of this 
paper, see | [22| for more information and references. 

There are a couple of other sketching techniques that try to maintain some form of truncated SVD as the 
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data is arriving. Incremental SVD and its variants [11 30 32,38 48|, are quite similar to FrequentDi 


RECTIONS, but do not perform the shrinkage step; with each new row they recompute the truncated SVD, 
simply deleting the smallest singular value/vector. No worst case error bounds can be shown for this ap¬ 
proach, and a recent paper | j26| demonstrates how it may fail dramatically; this paper |26| (which appeared 
after the initial conference versions of this paper) also shows how to approach or surpass the performance 
of Incremental SVD approaches using variants of the FrequentDirections approach described herein. 
Another approach by Feldman et al. 1241 decomposes the stream into geometrically increasing subparts, 
and maintains a truncated SVD for each. Recombining these in a careful process prevents the error from 
accumulating too much, and takes 0 (n • poly (d,£, logn)) time. 


Catalog of Related Bounds. We have summarized bounds of row-update streaming in Table[l] The space 
and time bounds are given in terms of n (the number of rows), d (the number of columns), k (the specified 
rank to approximate), r (the rank of input matrix A), e (an error parameter), and 6 (the probability of failure 
of a randomized algorithm). An expresion ()(x) hides polylog(rc) terms. The size is sometimes measured 
in terms of the number of rows (#R). Otherwise, if #R is not specified the space refers the number of words 
in the RAM model where it is assumed 0(log nd) bits fit in a single word. 

Recall that the error can be measured in several ways. 

• A construction result is denoted C. 

• If it is not constructive, it is denoted by a P for projection if it uses a projection A = tt b (A). Alterna¬ 
tively if the weaker form of A = IT B (A), then this is denoted P r where r is the rank of B before the 
projection. 

• In most recent results a projection error of 1 + e is obtained, and is denoted eR. 

• Although in some cases a weaker additive error of the form ||A — A\\^ < ||A — + e||A|||, is 

achieved and denoted eA. 

This can sometimes also be expressed as a spectral norm of the form || A—A\\ 2 < ll 7 ^ - ^fcll2+ e ll^llF 
(note the error term e|| A\\ 2 F still has a Frobenius norm). This is denoted el_ 2 . 

• In a few cases the error does not follow these patterns and we specially denote it. 

• Algorithms are randomized unless it is specified. In all tables we state bounds for a constant probabil¬ 
ity of failure. If we want to decrease the probability of failure to some parameter 6 , we can generally 
increase the size and runtime by 0(log(l/5)). 


It is worth noting that under the construction model, and allowing streaming turnstile updates to each 
element of the matrix, the hashing algorithm has been shown space optimal by Clarkson and Woodruff 1131 
(assuming each matrix entry requires 0(log nd) bits, and otherwise off by only a 0(log nd) factor). It 
is randomized and it constructs a decomposition of a rank k matrix A that satisfies \\A — A\\p < (1 + 
e)||A — Af : \\p, with probability at least 1 — 5. This provides a relative error construction bound of size 
0((k/e)(n + d ) log(nd)) bits. They also show an D((fe/e)(n + d)) bits lower bound. Our paper shows that 
the row-update model is strictly easier with a lower upper bound. 

Although not explicitly described in their paper 1131, one can directly use their techniques and analysis 
to achieve a weak form of a non-construction projection bound. One maintains a matrix B = ,4,3’ with 
m = 0((k/e) log(l/5)) columns where S is a d x m matrix where each entry is chosen from {—1, +1} at 
random. Then setting A = vr b{A), achieves a proj-err = 1 + e, however B is rank 0{(k/e) log( 1 /5)) and 
hence that is the only bound on A as well. 


6 Experimental Results 

We compare FrequentDirections to five different algorithms. The first two constitute brute force and 
naive baselines, described precisely next. The other three are common algorithms that are used in practice: 
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Table 1: Streaming Algorithms 


Streaming algorithms 

Paper 

Space 

Time 

Bound 

DKM06 

LinearTim 

-0| 

eSVD 

#R = 0(l/e 2 ) 

0((d+ l/e 2 )/e A ) 

0((d + l/e 2 )/e 4 + nnz(A)) 

P, eL 2 


#R = 0(k/e 2 ) 
0((k/e 2 ) 2 (d + k/e 2 )) 

0{(k/e 2 ) 2 (d + k/e 2 ) + 

nnz(A)) 

P, eA 

Sar06 [51 
turnstile 


#R = 0(k/e + klogk) 
0(d(k/e + klogk )) 

0(r\r\z(A)(k/e + klogk) + 
d(k/e + klogk) 2 )) 

^ 0(k/e-\-k log k) , > 

CW09 [ 

13 

I 

#R = O(kfe) 

0{nd 2 + ( ndk/e )) 

P0(fc/e)’ 

CW09 [ 

13 


0((n + d)(k/e )) 

0{nd 2 + (ndk/e)) 

C,eR 

FSS13 

determi 

24 

nis 

1 

tic 

0((dk/e ) logn) 

n((dk/e) logn) 0 ^ 

P, eR 

This paper 
deterministic 

#R = \l/e\ 

@(d/e) 

0(nd/e) 

covar-err < e 

#R = |" k/e + k] 

Q(dk/e) 

0(ndk/e) 

P, eR 


sampling, hashing, and random-projection, described in Section[5] All tested methods receive the rows of an 
n x d matrix A one by one. They are all limited in storage to an £ x d sketch matrix B and additional o(£d ) 
space for any auxiliary variables. This is with the exception of the brute force algorithm that requires 0(d 2 ) 
space. For a given input matrix A we compare the computational efficiency of the different methods and 
their resulting sketch accuracy. The computational efficiency is taken as the time required to produce B from 
the stream of A’s rows. The accuracy of a sketch matrix B is measured by both the covariance error and the 
projection error. Since some of the algorithms below are randomized, each algorithm was executed 5 times 
for each input parameter setting. The reported results are median values of these independent executions. 
All methods are implemented in python with NumPy and compiled by python 2.7 compiler, and experiments 
are performed on a linux machine with a 6 Core Intel (R) Xeon (R) 2.4GHz CPU and 128GB RAM. 

6.1 Competing algorithms 

Brute Force: The brute force approach produces the optimal rank i approximation of A. It explicitly com¬ 
putes the matrix A T A = Ya =t ^/A; by aggregating the outer products of the rows of A. The final ‘sketch’ 
consists of the top £ right singular vectors and values (square rooted) of A T A which are obtained by com¬ 
puting its SVD. The update time of Brute Force is @(d 2 ) and its space requirement is Q(d 2 ). 

Naive: Upon receiving a row in A the naive method does nothing. The sketch it returns is an all zeros £ by 
d matrix. This baseline is important for two reasons: First, it can actually be more accurate than random 
methods due to under sampling scaling issues. Second, although it does not perform any computation, it 
does incur computation overheads such as I/O exactly like the other methods. 

The other three baselines are Sampling, Hashing, and Random-projection, described in Section[5] 

6.2 Datasets 

We compare the performance of our algorithm on both synthetic and real datasets. For the synthetic dataset, 
each row of the generated input matrices, A, consists of an m dimensional signal and d dimensional noise 
(m <C d). More accurately, A = SDU + N/(. The signal coefficients matrix S £ M nxm is such that 
Sij ~ iV(0,1) i.i.d. The matrix D £ M mxm has only diagonal non-zero entered = l — (i—l)/m, which 
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Figure 2: Covariance error vs sketch size. On synthetic data with different signal dimension (m). m = 10 
(left), m = 20 (middle), and m = 50 (right). 





Figure 3: Projection error vs sketch size. On synthetic data with different signal dimension (m). m = 10 
(left), m = 20 (middle), and m = 50 (right). 


gives linearly diminishing signal singular values. The signal row space matrix U E M mx ' / contains a random 
m dimensional subspace in W l , for clarity, UU 1 = Id- The matrix SDU is exactly rank rri and constitutes 
the signal we wish to recover. The matrix N E W ixd contributes additive Gaussian noise Nij ~ N( 0,1). 
Due to | [54| , the spectral norms of SDU and N are expected to be the same up to some universal constant c\. 
Experimentally, c.\ « 1. Therefore, when £ < C| wc cannot expect to recover the signal because the noise 
spectrally dominates it. On the other hand, when £ > c\ the spectral norm is dominated by the signal which 
is therefore recoverable. Note that the Frobenius norm of A is dominated by the noise for any ( < C 2 yjd/m, 
for another constant close to 1, C 2 - Therefore, in the typical case where c\ < ( < 02 \f d/rri, the signal is 
recoverable by spectral methods even though the vast majority of the energy in each row is due to noise. In 
our experiments, we consider n E {10000, 20000,..., 100000} ( n = 10000 as default value), d = 1000, 
m- E {10, 20, 50} (m = 10 as default value), and £ E {5,10,15, 20} (( = 10 as default value). In projection 
eiTor experiments, we recover rank k = m of the dataset. 

We consider the real-world dataset Birds [1] in which each row represents an image of a bird, and each 
column a feature. This dataset has 11788 data points and 312 features, and we recover rank k = 100 for 
projection error experiments. 


6.3 Results 

The performance of FrequentDirections was measured both in terms of accuracy and running time 
compared to the above algorithms. In the first experiment, a moderately sized matrix (10,000 x 1,000) was 
approximated by each algorithm. The moderate input matrix size is needed to accommodate the brute force 
algorithm and to enable the exact error measure. The results are shown as we vary the data dimension m 
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Figure 4: Covariance error vs sketch size. On synthetic data with different noise ratio (rj). rj = 5 (left), 
77 = 15 (middle), and r/ = 20 (right). 



Sketch Size 


Figure 5: Projection error vs sketch size. On synthetic data with different noise ratio (rj). // = 5 (left), 
rj = 15 (middle), and 7 / = 20 (right). 


in Figure [2] for covariance error and Figure [3] for projection error; similar results are shown as the noise 
parameter rj is changed in Figure [4]for covariance error and Figure [5] for projection error. These give rise to 
a few interesting observations. First, all three random techniques actually perform worse in covariance error 
than naive for small sketch sizes. This is a side effect of under-sampling which causes overcorrection. This 
is not the case with FrequentDirections. Second, the three random techniques perform equally well. 
This might be a result of the chosen input. Nevertheless, practitioners should consider these as potentially 
comparable alternatives. Third, the curve indicated by “Frequent Direction Bound” plots the relative accu¬ 
racy guaranteed by FrequentDirections, which is equal to l/l in case of covariance error, and equal 
to l/(t — k) in case of projection error. Note that in covariance error plots,“Frequent Direction Bound” 
is consistently lower than the random methods. This means that the worst case performance guarantee is 
lower than the actual performance of the competing algorithms. Finally, FrequentDirections produces 
significantly more accurate sketches than predicted by its worst case analysis. 

The running time of FrequentDirections (specifically Algorithm [3TT]), however, is not better than its 
competitors. This is clearly predicted by their asymptotic running times. In Figure [6] as we vary the data 
dimension m and in Figure [7]as we vary the noise parameter rj, the running times (in seconds) of the sketch¬ 
ing algorithms are plotted as a function of their sketch sizes. Clearly, the larger the sketch, the higher the 
running time. Note that hashing is extremely fast. In fact, it is almost as fast as naive, which does nothing 
other than read the data! Sampling is also faster than Frequent Directions. Random-project is also faster 
than FrequentDirections, but only by a factor 3; this makes sense since they have the same asymptotic 
runtime, but random-projection just needs to add each new row to l other vectors while FrequentDirec¬ 
tions amortizes over more complex svd calls. It is important to stress that the implementations above are 
not carefully optimized. Different implementations might lead to different results. 

Similar results in error and runtime are shown on the real Birds data set in Figure[8] 
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Figure 6: Running time (seconds) vs sketch size. On synthetic data with different signal dimension (m). 
m = 10 (left), m = 20 (middle), and m = 50 (right). 
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Figure 7: Running time (seconds) vs sketch size. On synthetic data with different noise ratio (r ]). // = 5 
(left), = 10 (middle), and // = 15 (right). 





Figure 8: Covariance error (left). Projection error (middle) and Running time (right) on birds data. 


Nevertheless, we will claim that FrequentDirections scales well. Its running time is 0(ndi), which 
is linear in each of the three terms. In Figure [9] we fix the sketch size to be i = 100 and increase n and 
d. Note that the running time is indeed linear in both n and d as predicted. Moreover, sketching an input 
matrix of size 10 5 x 10 4 requires roughly 3 minutes. Assuming 4 byte floating point numbers, this matrix 
occupies roughly 4Gb of disk space. More importantly though, FrequentDirections is a streaming 
algorithm. Thus, its memory footprint is fixed and its running time is exactly linear in n. For example, 
sketching a 40Gb matrix of size 10 6 x 10 4 terminates in half an hour. The fact that FrequentDirections 
is also perfectly parallelizable (Section [3Tj ) makes FrequentDirections applicable to truly massive and 
distributed matrices. 
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Figure 9: Running time in seconds vs. input size. Flere, we measure only the running time of Frequent- 
Directions (specifically Algorithm [371]). The sketch size is kept fixed at i = 100. The size of the synthetic 
input matrix is n x d. The value of n is indicated on the x-axis. Different plot lines correspond to different 
values of d (indicated in the legend). The running time is measured in seconds and is indicated on the y-axis. 
It is clear from this plot that the running time of FrequentDirections is linear in both n and d. Note 
also that sketching a 10 5 x 10 4 dense matrix terminates in roughly 3 minutes. 
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